Load the dataset
The collected dataset includes information about houses on sale in the Dublin area. Each house is an entry of the dataset: a mixed-type data comprising of numerical, categorical and textual data.
The goal is to combine both numerical/categorical features and textual features to predict if the house-price is above or below 550,000
The house price is determined by some factors like
- location (area),
- surface (size),
- the number of bedrooms,
- the number of bathrooms,
- property type,
- house-features (size of the windows, construction material).
The physical attributes of the house such as the number of bedrooms, the number of bathrooms, the surface of the house, property type, and its location are information that is directly accessible from the dataset. Instead, the house-features can (sometimes only indirectly) be inferred from the house-description, house-facility and house-features. You can download the dataset from this url: https://github.com/benavoli/ST8003/tree/main/session05 You can see a typical entry in the dataset hereafter
data <- read.csv(file = '../session5/train.csv',sep="," )
data['pricerange']<-as.vector(data['price']>550000)+0.0# we make a column which is 1 when price
#is above 550000 and zero otherwise
data[1,]
Data Cleaning, Covariate selection and preprocessing
We select some of the columns (‘bathrooms’,‘beds’,‘surface’) we will use as predictors for price
datasel = data[c('bathrooms','beds','surface','pricerange')]
datasel = na.omit(datasel)# we remove all the rows including nan
datasel
Linear regression
We now fit linear regression
model = glm(pricerange ~ bathrooms + beds + surface, family = "binomial", data = datasel)
glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model)
Call:
glm(formula = pricerange ~ bathrooms + beds + surface, family = "binomial",
data = datasel)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.2076 -0.6062 -0.3426 0.5130 3.3574
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.646e+00 2.300e-01 -24.544 < 2e-16 ***
bathrooms 3.693e-01 6.577e-02 5.615 1.97e-08 ***
beds 1.223e+00 7.267e-02 16.824 < 2e-16 ***
surface 6.822e-05 2.379e-05 2.867 0.00415 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2889.9 on 2401 degrees of freedom
Residual deviance: 2022.6 on 2398 degrees of freedom
AIC: 2030.6
Number of Fisher Scoring iterations: 5
Is this a good model? Can we use other columns in data to improve the model? Can we include polynomial and/or interaction terms to improve the model? Use the model selection approaches you learned in session 5 and 6 to find a better model.
Unseen data
You can test the predictive performance of our best model on unseen data
datatest <- read.csv(file = '../session5/test.csv',sep="," )
datatest[10:28,3:16]#one of the entries, there are 16 columns, the first two columns are just ids. The price column is not reported. You have to predict the price for all the entries in dataset
Prediction
predictions <- predict(model,datatest, type="response")
predictions[1:5]
1 2 3 4 5
0.4980193 0.2257686 0.5898662 0.3792981 0.1675038
these are the predicted probabilities for pricerange to be 1 for 5 houses in the dataset. You can save and submit your best predictions for our internal data science competition. This is the code, which uses the threshold 0.5 to predict 1, that is house price above 550000.
write.csv(predictions,"name_surname.csv")
We will use accuracy score to evaluate the accuracy of your predictions.
LS0tCnRpdGxlOiAiU2Vzc2lvbiA1OiBsaW5lYXIgcmVncmVzc2lvbiIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKIyBMb2FkIHRoZSBkYXRhc2V0IApUaGUgY29sbGVjdGVkIGRhdGFzZXQgaW5jbHVkZXMgaW5mb3JtYXRpb24gYWJvdXQgaG91c2VzIG9uIHNhbGUgaW4gdGhlIER1YmxpbiBhcmVhLiBFYWNoIGhvdXNlIGlzIGFuIGVudHJ5IG9mIHRoZSBkYXRhc2V0OiBhIG1peGVkLXR5cGUgZGF0YSBjb21wcmlzaW5nIG9mIG51bWVyaWNhbCwgY2F0ZWdvcmljYWwgYW5kIHRleHR1YWwgZGF0YS4KClRoZSBnb2FsIGlzIHRvIGNvbWJpbmUgYm90aCBudW1lcmljYWwvY2F0ZWdvcmljYWwgZmVhdHVyZXMgYW5kIHRleHR1YWwgZmVhdHVyZXMgdG8gcHJlZGljdCBpZiB0aGUgaG91c2UtcHJpY2UgaXMgYWJvdmUgb3IgYmVsb3cgNTUwLDAwMAoKVGhlIGhvdXNlIHByaWNlIGlzIGRldGVybWluZWQgYnkgc29tZSBmYWN0b3JzIGxpa2UKCiogbG9jYXRpb24gKGFyZWEpLAoqIHN1cmZhY2UgKHNpemUpLAoqIHRoZSBudW1iZXIgb2YgYmVkcm9vbXMsCiogdGhlIG51bWJlciBvZiBiYXRocm9vbXMsCiogcHJvcGVydHkgdHlwZSwKKiBob3VzZS1mZWF0dXJlcyAoc2l6ZSBvZiB0aGUgd2luZG93cywgY29uc3RydWN0aW9uIG1hdGVyaWFsKS4KClRoZSBwaHlzaWNhbCBhdHRyaWJ1dGVzIG9mIHRoZSBob3VzZSBzdWNoIGFzIHRoZSBudW1iZXIgb2YgYmVkcm9vbXMsIHRoZSBudW1iZXIgb2YgYmF0aHJvb21zLCB0aGUgc3VyZmFjZSBvZiB0aGUgaG91c2UsIHByb3BlcnR5IHR5cGUsIGFuZCBpdHMgbG9jYXRpb24gYXJlIGluZm9ybWF0aW9uIHRoYXQgaXMgZGlyZWN0bHkgYWNjZXNzaWJsZSBmcm9tIHRoZSBkYXRhc2V0LgpJbnN0ZWFkLCB0aGUgaG91c2UtZmVhdHVyZXMgY2FuIChzb21ldGltZXMgb25seSBpbmRpcmVjdGx5KSBiZSBpbmZlcnJlZCBmcm9tIHRoZSBob3VzZS1kZXNjcmlwdGlvbiwgaG91c2UtZmFjaWxpdHkgYW5kIGhvdXNlLWZlYXR1cmVzLgpZb3UgY2FuIGRvd25sb2FkIHRoZSBkYXRhc2V0IGZyb20gdGhpcyB1cmw6Cmh0dHBzOi8vZ2l0aHViLmNvbS9iZW5hdm9saS9TVDgwMDMvdHJlZS9tYWluL3Nlc3Npb24wNQpZb3UgY2FuIHNlZSBhIHR5cGljYWwgZW50cnkgaW4gdGhlIGRhdGFzZXQgaGVyZWFmdGVyCgpgYGB7cn0KZGF0YSA8LSByZWFkLmNzdihmaWxlID0gJy4uL3Nlc3Npb241L3RyYWluLmNzdicsc2VwPSIsIiApCmRhdGFbJ3ByaWNlcmFuZ2UnXTwtYXMudmVjdG9yKGRhdGFbJ3ByaWNlJ10+NTUwMDAwKSswLjAjIHdlIG1ha2UgYSBjb2x1bW4gd2hpY2ggaXMgMSB3aGVuIHByaWNlIAojaXMgYWJvdmUgNTUwMDAwIGFuZCB6ZXJvIG90aGVyd2lzZQpkYXRhWzEsXQpgYGAKCiMgRGF0YSBDbGVhbmluZywgQ292YXJpYXRlIHNlbGVjdGlvbiBhbmQgcHJlcHJvY2Vzc2luZwpXZSBzZWxlY3Qgc29tZSBvZiB0aGUgY29sdW1ucyAoJ2JhdGhyb29tcycsJ2JlZHMnLCdzdXJmYWNlJykgd2Ugd2lsbCB1c2UgYXMgcHJlZGljdG9ycyBmb3IgcHJpY2UKYGBge3J9CmRhdGFzZWwgPSBkYXRhW2MoJ2JhdGhyb29tcycsJ2JlZHMnLCdzdXJmYWNlJywncHJpY2VyYW5nZScpXQpkYXRhc2VsID0gbmEub21pdChkYXRhc2VsKSMgd2UgcmVtb3ZlIGFsbCB0aGUgcm93cyBpbmNsdWRpbmcgbmFuCmRhdGFzZWwKYGBgCgojIExpbmVhciByZWdyZXNzaW9uCldlIG5vdyBmaXQgbGluZWFyIHJlZ3Jlc3Npb24KYGBge3J9Cm1vZGVsID0gZ2xtKHByaWNlcmFuZ2UgfiBiYXRocm9vbXMgKyBiZWRzICsgc3VyZmFjZSwgIGZhbWlseSA9ICJiaW5vbWlhbCIsIGRhdGEgPSBkYXRhc2VsKQpzdW1tYXJ5KG1vZGVsKQpgYGAKSXMgdGhpcyBhIGdvb2QgbW9kZWw/IENhbiB3ZSB1c2Ugb3RoZXIgY29sdW1ucyBpbiBgZGF0YWAgdG8gaW1wcm92ZSB0aGUgbW9kZWw/CkNhbiB3ZSBpbmNsdWRlIHBvbHlub21pYWwgYW5kL29yIGludGVyYWN0aW9uIHRlcm1zIHRvIGltcHJvdmUgdGhlIG1vZGVsPwpVc2UgdGhlIG1vZGVsIHNlbGVjdGlvbiBhcHByb2FjaGVzIHlvdSBsZWFybmVkIGluIHNlc3Npb24gNSBhbmQgNiB0byBmaW5kIGEgYmV0dGVyIG1vZGVsLgoKCiMgVW5zZWVuIGRhdGEKWW91IGNhbiB0ZXN0IHRoZSBwcmVkaWN0aXZlIHBlcmZvcm1hbmNlIG9mIG91ciBiZXN0IG1vZGVsIG9uIHVuc2VlbiBkYXRhCmBgYHtyfQpkYXRhdGVzdCA8LSByZWFkLmNzdihmaWxlID0gJy4uL3Nlc3Npb241L3Rlc3QuY3N2JyxzZXA9IiwiICkKZGF0YXRlc3RbMTA6MjgsMzoxNl0jb25lIG9mIHRoZSBlbnRyaWVzLCB0aGVyZSBhcmUgMTYgY29sdW1ucywgdGhlIGZpcnN0IHR3byBjb2x1bW5zIGFyZSBqdXN0IGlkcy4gVGhlIHByaWNlIGNvbHVtbiBpcyBub3QgcmVwb3J0ZWQuIFlvdSBoYXZlIHRvIHByZWRpY3QgdGhlIHByaWNlIGZvciBhbGwgdGhlIGVudHJpZXMgaW4gZGF0YXNldApgYGAKClByZWRpY3Rpb24KYGBge3J9CnByZWRpY3Rpb25zIDwtIHByZWRpY3QobW9kZWwsZGF0YXRlc3QsIHR5cGU9InJlc3BvbnNlIikKcHJlZGljdGlvbnNbMTo1XQpgYGAKdGhlc2UgYXJlIHRoZSBwcmVkaWN0ZWQgcHJvYmFiaWxpdGllcyBmb3IgcHJpY2VyYW5nZSB0byBiZSAxIGZvciA1IGhvdXNlcyBpbiB0aGUgZGF0YXNldC4gWW91IGNhbiBzYXZlIGFuZCBzdWJtaXQgeW91ciBiZXN0IHByZWRpY3Rpb25zIGZvciBvdXIgaW50ZXJuYWwgZGF0YSBzY2llbmNlIGNvbXBldGl0aW9uLiBUaGlzIGlzIHRoZQpjb2RlLCB3aGljaCB1c2VzIHRoZSB0aHJlc2hvbGQgMC41IHRvIHByZWRpY3QgMSwgdGhhdCBpcyBob3VzZSBwcmljZSBhYm92ZSA1NTAwMDAuCgpgYGB7cn0Kd3JpdGUuY3N2KGFzLmFycmF5KHByZWRpY3Rpb25zPjAuNSksIm5hbWVfc3VybmFtZS5jc3YiKQpgYGAKV2Ugd2lsbCB1c2UgYGFjY3VyYWN5IHNjb3JlYCB0byBldmFsdWF0ZSB0aGUgYWNjdXJhY3kgb2YKeW91ciBwcmVkaWN0aW9ucy4=